Report to BMM-based Chinese Word Segmentor with Context-based Unknown Word Identifier for the Second International Chinese Word Segmentation Bakeoff

نویسنده

  • Jia-Lin Tsai
چکیده

This paper describes a Chinese word segmentor (CWS) based on backward maximum matching (BMM) technique for the 2 nd Chinese Word Segmentation Bakeoff in the Microsoft Research (MSR) closed testing track. Our CWS comprises of a context-based Chinese unknown word identifier (UWI). All the context-based knowledge for the UWI is fully automatically generated by the MSR training corpus. According to the scored results of the MSR closed testing track and our analysis, it shows that our BMM-based CWS with the context-based UWI is a simple and effective system to achieve high Chinese word segmentation performance of more than 95.5% F-measure.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

BMM-Based Chinese Word Segmentor with Word Support Model for the SIGHAN Bakeoff 2006

This paper describes a Chinese word segmentor (CWS) for the third International Chinese Language Processing Bakeoff (SIGHAN Bakeoff 2006). We participate in the word segmentation task at the Microsoft Research (MSR) closed testing track. Our CWS is based on backward maximum matching with word support model (WSM) and contextual-based Chinese unknown word identification. From the scored results a...

متن کامل

The First International Chinese Word Segmentation Bakeoff

This paper presents the results from the ACL-SIGHAN-sponsored First International Chinese Word Segmentation Bakeoff held in 2003 and reported in conjunction with the Second SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan. We give the motivation for having an international segmentation contest (given that there have been two within-China contests to date) and we report on the resu...

متن کامل

SYSTRAN's Chinese Word Segmentation

SYSTRAN’s Chinese word segmentation is one important component of its Chinese-English machine translation system. The Chinese word segmentation module uses a rule-based approach, based on a large dictionary and fine-grained linguistic rules. It works on generalpurpose texts from different Chinesespeaking regions, with comparable performance. SYSTRAN participated in the four open tracks in the F...

متن کامل

Towards a Hybrid Model for Chinese Word Segmentation

This paper describes a hybrid Chinese word segmenter that is being developed as part of a larger Chinese unknown word resolution system. The segmenter consists of two components: a tagging component that uses the transformation-based learning algorithm to tag each character with its position in a word, and a merging component that transforms a tagged character sequence into a word-segmented sen...

متن کامل

Chinese Word Segmentation Based On Direct Maximum Entropy Model

Chinese word segmentation is a fundamental and important issue in Chinese information processing. In order to find a unified approach for Chinese word segmentation, the author develop a Chinese lexical analyzer PCWS using direct maximum entropy model. The paper presents the general description of PCWS, as well as the result and analysis of its performance at the Second International Chinese Wor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005